# -*- mode: org; fill-column: 80; -*-
#+TITLE: Chapter-01
#+AUTHOR: Zelphir Kaltstahl
#+STARTUP: content indent align inlineimages hideblocks entitiesplain nologdone nologreschedule nologredeadline nologrefile
#+TODO: TODO INPROGRESS | DONE
#+DATE: <2021-02-13 Sa>
#+LANGUAGE: English
#+PRIORITIES: A E E

* Introduction

This document contains notes about the book "Programming in Standard ML" by
Robert Harper.

* Keywords

** val

The keyword ~val~ introduces a new binding:

#+begin_src sml :results output drawer
val something = 3;
something;
#+end_src

#+RESULTS:
:results:
val it = 3 : int
:end:

* A regular expression package

** Specification / Type Declaration

What follows is a definition of signatures of a regular expression package, as
described in the book.

#+begin_src sml :results output drawer
signature REGEXP = sig
    datatype regexp =
             Zero | One | Char of char |
             Plus of regexp * regexp |
             Times of regexp * regexp |
             Star of regexp
    exception SyntaxError of string
    val parse : string -> regexp
    val format : regexp -> string
end

signature MATCHER = sig
    structure RegExp : REGEXP
    val accepts : RegExp.regexp -> string -> bool
end
#+end_src

#+RESULTS:
:results:
signature REGEXP =
  sig
    datatype regexp
      = Char of char
      | One                      |
      | Plus of regexp * regexp  |
      | Star of regexp           |
      | Times of regexp * regexp |
      | Zero                     |
    exception SyntaxError of string
    val parse : string -> regexp
    val format : regexp -> string
  end
signature MATCHER =
  sig
    structure RegExp :
      sig
        datatype regexp
          = Char of char
          | One                      |
          | Plus of regexp * regexp  |
          | Star of regexp           |
          | Times of regexp * regexp |
          | Zero                     |
        exception SyntaxError of string
        val parse : string -> regexp
        val format : regexp -> string
      end
    val accepts : RegExp.regexp -> string -> bool
  end
:end:

What can we get from the written definitions and the text in the book?

- ~signature~ is used to describe modules. Apparently this is how one describes
  the types inside modules in SML.

- ~datatype~ is used to define a type ~regexp~ (regular expression), which is
  one of the stated alternatives.

- Alternatives for ~datatype~ are assigned using the equal sign ~=~. The idea
  is, that the the datatype really is the same as the union of all the listed
  types.

- Alternatives for ~datatype~ are separated by pipes ~|~.

- ~val~ is used to introduce binding definitions inside module descriptions as
  well.

- Types of function bindings are specified using the colon ~:~ after the binding
  name. The functions are not equal to the types, which is probably why the ~=~
  sign is not used.

- SML has the concept of exceptions.

- SML seems to start definitions like signature definitions with a keyword, in
  this example ~sig~ and end them using the keyword ~end~. (But why is this not
  used for ~structure~? Probably, because ~structure~ only allows one thing to
  follow after it and that is why the expression is unambigious. Or perhaps the
  difference is made between block expressions like ~sig~ - ~end~ and
  statements.

- Function definitions use the arrow ~->~ to separate types of input arguments
  and types of result values. If a function takes multiple arguments, it is
  probably curried and therefore written with multiple arrows ~->~. For example
  the function ~accepts~ takes a ~RegExp.regexp~ and a ~string~ and returns a
  ~bool~, but perhaps it is curried and works by first taking a ~RegExp.regexp~
  and returns a function, which takes a ~string~ and that function then returns
  the ~bool~.

- There seems to be a convention of naming signatures in all capital letters.

- ~MATCHER~ depends on ~REGEXP~.

- Signatures are descriptions of what needs to be implemented.

- Structures implement signatures.

- In SMLNJ the dot ~.~ is used to access structure components. For example
  ~MyStructure.TheComponent~.

The book goes on to show how one would write structure declarations:

#+begin_src sml :results output drawer
structure RegExp :> REGEXP = ...
structure Matcher :> MATCHER = ...
#+end_src

From this we can gather, that apparently to implement a signature using a
structure, one writes not ~:~, but ~:>~ instead and then declare the structure
to be equal to the actual implementation. The actual implementation needs to
conform to the signature ~REGEXP~. This is type checked. SML also type checks,
that the implementation does not make use of anything, that is not specified in
the signature.

The book shows then a usage example:

#+begin_src sml :results output drawer
val regexp = Matcher.RegExp.parse "(a+b)*"
val matcher = Matcher.accepts regexp  (* currying in action, returning a function *)
val ex1 = matches "aabba"  (* true *)
val ex2 = matches "abac"  (* false *)
#+end_src

One should use the /long identifier/ ~Matcher.RegExp.parse~ instead of
~RegExp.parse~, since SML makes a difference between the two and there can be
issues with using ~RegExp.parse~ instead. The book calls this /sharing/, as the
implementation of ~RegExp.parse~, if used directly, could be used in multiple
contexts in the code. However, always writing the long identifier, the full path
to the function or member could become unreadable and typing it annoying. There
are ways to alleviate this problem:

#+begin_src sml :results output drawer
structure M = Matcher
structure R = M.RegExp  (* can also use dots here *)
val regexp = R.parse "((a + %).(b + %))*"
val matches = M.accepts regexp
val ex1 = matches "aabba"
val ex2 = matches "abac"
#+end_src

#+RESULTS:
:results:
stdIn:36.15-36.22 Error: unbound structure: Matcher
:end:

** Implementation

The book shows an overview of an implementation for the regular expression
package:

#+begin_src sml :results output drawer
structure RegExp :> REGEXP = struct
    datatype regexp =
             Zero | One | Char of char |
             Plus of regexp * regexp |
             Times of regexp * regexp |
             Star of regexp
    fun tokenize s = ...
    fun parse s =
        let
            val (r, s') =
                parse_rexp (tokenize (String.explode s))
        in
            case s' of
                nil => r
            | _ => raise SyntaxError "Bad input."
        end
        handle LexicalError =>
               raise SyntaxError "Bad input."
        ...
    fun format r =
        String.implode (format_exp r)
end
#+end_src

What can we understand from the above code?

- Structures:

  - Structures, which are signature implementations, are wrapped by ~struct~ and
    ~end~. This might be useful, when defining a structure inline or simply to
    avoid giving whitespace or indentation meaning.

- Let:

  - There is a syntax form ~let~-~in~-~end~, which creates bindings for use within
    its ~in~ block.

- Exception handling:

  - The keyword ~raise~ is used to raise exceptions.

  - The keyword ~handle~ is used to specify, how to handle an exception from a
    lower layer.

- Functions:

  - The keyword ~fun~ is used to indicate implementations of functions. After
    ~fun~ follows the function name.

  - Functions can return multiple values (perhaps only as list or tuple) and those
    can be destructured by using ~val~ and a tuple of binding names.

  - After the binding name follow the arguments names of a function.

  - After the argument names of a function follows the equal sign ~=~ to
    indicate, that the function is equal to its implementation.

  - Functions apparently do not end using the ~end~ keyword. This is
    surprising. Perhaps the ~end~ keyword merely needs not to be written in such
    a definition. How would one write an anonymous function inline, if it is not
    delimited using any keyword or special character?

- Cases:

  - There is a keyword ~case~ for distinguishing multiple cases and the cases are
    separated using ~|~.

  - Perhaps the underscore ~_~ stands for a case, in which the value of the
    binding is irrelevant.

  - The fat arrow ~=>~ is used to separate cases from their consequences.

- Strings:

  - Strings are written in double quotes.

  - There seem to be static methods of ~String~, for example ~String.explode~ and
    ~String.implode~.

*** Implementation of tokenize

#+begin_src sml :results output drawer
datatype token =
         AtSign | Percent | Literal of char |
         PlusSign | TimesSign |
         Asterisk | LParen | RParen
exception LexicalError
fun tokenize nil = nil
    | tokenize (#"+" :: cs) = PlusSign :: tokenize cs
    | tokenize (#"." :: cs) = TimesSign :: tokenize cs
    | tokenize (#"*" :: cs) = Asterisk :: tokenize cs
    | tokenize (#"(" :: cs) = LParen :: tokenize cs
    | tokenize (#")" :: cs) = RParen :: tokenize cs
    | tokenize (#"@" :: cs) = AtSign :: tokenize cs
    | tokenize (#"%" :: cs) = Percent :: tokenize cs
    | tokenize (#"\\" :: c :: cs) = Literal c :: tokenize cs
    | tokenize (#"\\" :: nil) = raise LexicalError
    | tokenize (#" " :: cs) = tokenize cs
    | tokenize (c :: cs) = Literal c :: tokenize cs
#+end_src

#+RESULTS:
:results:
datatype token
  = Asterisk
  | AtSign          |
  | LParen          |
  | Literal of char |
  | Percent         |
  | PlusSign        |
  | RParen          |
  | TimesSign       |
exception LexicalError
val tokenize = fn : char list -> token list
:end:

What can we understand from the above code?

- Characters are written as a string with a ~#~ in front.
- ~::~ is used as an operator for consing onto lists.
- SML can do recursive function calls.
- Pattern matching can use destructuring for lists.
- Exceptions are specified above functions, which might raise them.
- ~nil~ is the list terminating thing and probably the empty list.
- Backslashes in strings or as characters need to be escaped.
- The book mentions a few concepts and associates them with the characters,
  which are used in regular expressions to express the concepts:
  - Alternation: ~+~: choose between things
  - Concatenation: ~.~: put things together
  - Empty regular expression: ~@~: ???
  - Null string: ~%~: ???
  - Iteration: ~*~: ???
- Lists of things are displayed as ~TYPE list~.

The book goes on giving an unexplained dump of abbreviations in a grammar for
regular expressions, which is then implemented. It seems difficult to understand
it, because the abbreviations are not explained. Was it really too much work, to
write out what the abbreviations stand for? The grammar is the following:

#+begin_quote
rexp :: = rtrm | rtrm+rexp
rtrm :: = rfac | rfac.rtrm
rfac :: = ratm | ratm*
ratm :: = @ | % | a | (rexp)
#+end_quote

What I can guess is:

- ~rexp~: stands for regular expression
- ~rtrm~: stands for terminal symbol, perhaps
- ~ratm~: stands for atom, perhaps

I have no idea what ~rfac~ stands for. "Factor"? But what sense does that make?

The code, that follows the grammar is not much clearer to me.

For this reason I skip the rest of the regular expression package implementation
and go on with the rest of the book.

* Test lab

#+begin_src sml :results output
3;
3 + 4;
4 div 3;
4 mod 3;
#+end_src

#+RESULTS:
: val it = 7 : int
